The Sample Complexity of Dictionary Learning 



Daniel Vainsencher Shie Mannor 

danielv@tx . technion .ac.il shie@ee. technion .ac.il 

Department of Electrical Engineering Department of Electrical Engineering 

Technion, Israel Institute of Technology Technion, Israel Institute of Technology 

Haifa 32000, Israel " ~ Haifa 32000, Israel 

Alfred M. Bruckstein 
f redely @cs . technion .ac.il 
Department of Computer Science 
Technion, Israel Institute of Technology 
Haifa 32000, Israel 

November 25, 2010 



Abstract 

A large set of signals can sometimes be described sparsely using a dictionary, that is, ev- 
ery element can be represented as a linear combination of few elements from the dictionary. 
Algorithms for various signal processing applications, including classification, denoising and 
signal separation, learn a dictionary from a set of signals to be represented. Can we expect 
that the representation found by such a dictionary for a previously unseen example from the 
same source will have L-2 error of the same magnitude as those for the given examples? We as- 
sume signals are generated from a fixed distribution, and study this questions from a statistical 
learning theory perspective. 

We develop generalization bounds on the quality of the learned dictionary for two types of 
constraints on the coefficient selection, as measured by the expected error in representation 
when the dictionary is used. For the case of l\ regularized coefficient selection we provide a 



generalization bound of the order of O ^np\og(m\)/mj , where n is the dimension, p is the 
number of elements in the dictionary, A is a bound on the l\ norm of the coefficient vector and 
m is the number of samples, which complements existing results. For the case of representing 
a new signal as a combination of at most k dictionary elements, we provide a bound of the 
order 0(^Jnp log(mfc) / m) under an assumption on the level of orthogonality of the dictionary 
(low Babel function). We further show that this assumption holds for most dictionaries in high 
dimensions in a strong probabilistic sense. Our results further yield fast rates of order 1/rn 
as opposed to 1/ y/m using localized Rademacher complexity. We provide similar results in a 
general setting using kernels with weak smoothness requirements. 

1 Introduction 

In processing signals from X = 1" it is now a common technique to use sparse representations; that 
is, to approximate each signal x by a "small" linear combination a of elements di from a dictionary 
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D G X p , so that x « Da = 5^f=i a «^«- This has various uses detailed in Section fTTI The smallness 
of a is often measured using either \\a\\ v or the number of non zero elements in a, often denoted 
||a|| . The approximation error is measured here using a Euclidean norm appropriate to the vector 
space. We denote the approximation error of x using dictionary D and coefficients from A as 

fiA d(%) = m i n \\Da — x\\ , (1.1) 

where A is one of the following sets determining the sparsity required of the representation: 

H k = {a: \\a\\ < k} 
induced a "hard" sparsity constraint, which we also call k sparse representation, while 

R\ = {a : HaHi < A} 

induces a convex constraint that is a "relaxation" of the previous constraint. 
The dictionary learning problem is to find a dictionary D minimizing 



E(D)=E x ^ u h A)D (x), (1.2) 

where v is a distribution over signals that is known to us only through samples from it. The prob- 
lem addressed in this paper is the "generalization" (in the statistical learning sense) of dictionary 
learning: to what extent does the performance of a dictionary chosen based on a finite set of samples 
indicate its expected error in (11.2b ? This clearly depends on the number of samples and other param- 
eters of the problem such as dictionary size. In particular, an obvious algorithm is to represent each 
sample using itself, if the dictionary is allowed to be as large as the sample, but the performance on 
unseen signals is likely to disappoint. 

To state our goal more quantitatively, assume that an algorithm finds a dictionary D suited to k 
sparse representation, in the sense that the average representation error E m (D) on the m examples 
it is given is low. Our goal is to bound the generalization error e, which is the additional expected 
error that might be incurred: 

E(D) < (l + v )E m (D) + e, 

where rj > is sometimes zero, and the bound depends on the number of samples and problem 
parameters. Since algorithms that find the optimal dictionary for a given set of samples (also known 
as empirical risk minimization, or ERM, algorithms) are not known for dictionary learning, we 
prove uniform convergence bounds that apply simultaneously over all admissible dictionaries D, 
thus bounding from above the sample complexity of the dictionary learning problem. 

Many analytic and algorithmic methods relying on the properties of finite dimensional Euclidean 
geometry can be applied in more general settings by applying kernel methods. These consist of 
treating objects that are not naturally represented in W 1 as having their similarity described by an 
inner product in an abstract feature space that is Euclidean. This allows the application of algo- 
rithms depending on the data only through a computation of inner products to such diverse ob- 
jects as graphs, DNA sequences and text doc uments, that are not naturally represented using vector 



spaces (|Shawe-Taylor and Cristianin i. 2004). Is it possible to extend the usefulness of dictionary 



learning techniques to this setting? We address sample complexity aspects of this question as well. 
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1.1 Background and related work 



Sparse representations are a standard practice in diverse fields such as signal processing, natural 
language processing, etc. Typically, the dictionary is assumed to be known. The motivation for 
sparse representations is indicated by the following results, in which we assume the signals come 
from X = W 1 , and the representation coefficients from A = where k < n,p and typically 

hA,D(x) < 1. 



Compression: If a signal x has an approximate sparse representation in some commonly 
known dictionary D, then by definition, storing or transmitting the sparse representation will 
not cause large error. 

Representation: If a signal x has an approximate sparse representation in a dictionary D that 
fulfills certai n geometric condition s, then its sparse representation is unique and can be found 
efficiently (IBruckstein et al. i l2009h . 



Denoising: If a signal x has a sparse representation in some known dictionary D, and x = 
x + v, where the random noise v is G aussian, then th e sparse representation found for x will 
likely be very close to x (for example IChen et al.L l200lh . 



Compressed sensing: Assuming that a signal x has a sparse representation in some known dic- 
tionary D that fulfills certain geometric conditions, this representation can be approximately 
retrieved with high probability from a small number of random linear me asurements of x . 
The n umber of measurements needed depends on the sparsity of x in D ( Candes and Taol 
2006h . 



The implications of these results are significant when a dictionary D is known that sparsely 
represents simultaneously many signals. In some applications the dictionary is chosen based on 
prior knowledge, but in many applications the dictionary is learned based on a finite set of ex- 
amples. To motivate dictionary learning, consider an image representation used for compression 
or denoising. Different types of images may have different properties (MRI images are not sim- 
ilar to scenery images), so that learning a specific dictionary to each type of images may lead to 
improve d performance. The benefits of dictionary learning hav e been demonstrated in many appli- 
cations dProtter and Eladl 120071 : IPevrel 120091 : lYang et all l2009h . 

Two extensively used techniques related to dictionary learning are Principal Component Anal- 
ysis (PCA) and k means clustering. The former finds a single subspace minimizing the sum of 
squared representation errors which is very similar to dictionary learning with A = and p = k. 
The latter finds a set of locations minimizing the sum of squared distances between each signal and 
the location closest to it which is very similar to dictionary learning with A = Hi where p is the 
number of locations. Thus we could see dictionary learning as PCA with multiple subspaces, or 
as clustering where multiple loca tions are used to rep r esent each signal . The sample complexity of 
both algorithms are w ell studied teartlett et all 1 19981 : iBiau et all 120081 : Ishawe-Tavlor et all 120051 : 
Blanchardet"aill2007h . 

This paper does not address questions of computational cost, though they are very relevant. 
Finding optimal coeffi cients for k s parse representation (that is, minimizing (11.11 ) with A = H^) 
is NP-hard in general (lDavis et all 119970 . Dictionary learning as an optimization problem, that 
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of minimizing (11.21 ) is less well understood, even for empirical v (consisting of a finite num 
ber o f samples), despite over a decade of work on related algorithms with good empirical re 



suits (Olshausen and Field! 1 19971: Lewicki et al 



2005; Lee et al., 2007; Krause and Cevher, 2010 



19981: iKreutz-Delgado et"ail 120031 : [Aharon et al 



: lMairalet all 1201 



The only prior work we are aware of that addresses generalization in dictionary learning, by 
Maurer and Pontil (l201fjh . addresses the convex representation constraint A = R\; we discuss 
the relation of our work to theirs in Section |2] Another related work studies the identifiability 
of dictionaries, giving conditions under which a dictionary may be exactly recovered. A recent 
example giving somewhat similar requirements on the number of samples (though in a different 
setting, and to obtain a different kind of result) is by iGribonval and Schnassl (12009b . which also 
includes a review of identifiability results. 



2 Results 



Except where we state otherwise, we assume signals are generated in the unit sphere S n_1 . 

A new approach to dictionary learning generalization. Our first main contribution is an 
approach to gener a lizatio n bounds in dictionary learning that is complementary to that used by 
Maurer and Pontill (I2010r) . Assume that the columns of the dictionary D G M nxp are of unit 
length, and that each signal x G S n_1 is approximately represented in the form Da where the 
coefficient vector a is known to fulfill a constraint of form \\a\\ l < A. We quantify the com- 
plexity of the associated error function class in terms of A, so that standard methods of uniform 
convergence give generalization error bounds e of order O (^J np\og{m\) / with rj = 0. The 
method by Maurer and Pontil results in Theorem [3] given below providing generalization 

error bounds of order 



O \ p mm(p, n) ( A + ylog(mA) ) jm 



Thus the latter are applicable to the case n ^> p, while our approa ch is not. However in the 
case n < p, a l so kno wn in the literature as the "over-complete" case (|Qlshausen and Fieldti 1 19971 : 
Lewicki et all 1 19981) . the important complexity parameter is A, on which our bounds depend only 
logarithmically, instead of polynomially. One case where this is significant is where the represen- 
tation is chosen by solving a minimization problem such as min a ||L>a — X\\ + 7 • \\a\\i in which 

Fast rates. For th e case 77 > our methods are compatible with general fast rate methods 



of iBartlett et al.1 (|2005l ). for bounds of order 0(nplog(\m)/m). The main significance of this is 
not in the numerical results achieved, due to the large constants, but in that the general statistical 
behavior they imply occurs in dictionary learning. For example, generalization error has a "propor- 
tional" component which is reduced when the empirical error is low. Whether fast rates results can 
be proved under the infi nite dimension regim e is an interesting question we leave open. Note that 
due to lower bounds by IBartlett et all (119981) of order V m _1 on the fc-means clustering problem, 
which corresponds to dictionary learning for 1-sparse representation, fast rates may be expected 
only with 77 > 0, as presented here. 

We now describe the relevant function class and the bounds on its complexity, which are proved 
in Section[3l proving the following theorem The resulting generalization bounds are given explicitly 
at the end of this section. 
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Theorem 1. The function class Q\ = {hR Xt n : S n_1 -> R : D G R nxp , < l}, tafcew as a 
metric space with the metric induced by [|-|| , /jos an e cover of cardinality at most (4A/e) np . 

Extension to /c s parse representa t ion. Our second main contribution is to extend both our 



approach and that of iMaurer and Pontil to provide generalization bounds for dictionaries for 



k sparse representations, by using a bound A on the li norm of the representation coefficients when 
the dictionaries are close to orthogonal. Distance from orthogonality is measured by the Babel 
function, defined below and discussed in more detail in Section |4] 



Definition 1 (Babel function. iTroppI 120041) . For any k G N, the Babel function p, k : R nxm R+ 
is defined by: 

Vk( D ) = , ri ma * , max y2\(d x ,di)\. 

Ac{l,...,p|; A \=k %4k ^-^ 
AeA 

The following proposition, which is proved in Section [3j bounds the l-norm of the dictio- 
na ry coefficients for a k s p arse represen tation and also follows from analysis previously done 
bv iDonoho and Eladl (Eoolh : ITroppI (Eoolh . 



Proposition 1. Let G [1,7] cind fik-i (D) < 1, then a coefficient vector a G R p minimizing 
the k-sparse representation error hn k ,D(x) exists which has \\a\L < jk/ (1 — Hk-i (D)). 

We now consider the class of all k sparse representation error functions. We prove in Section [3] 
the following bound on the complexity of this class. 

Corollary 2. The function class J^s^k = {hH k ,D '■ S™^ 1 — > R : /j-k-i(D) < 5}, taken as a metric 
space with the metric induced by ||-|| , has an e cover of cardinality at most (4k/ (e (1 — 5))) np . 

The dependence of the last two results on ^k-i(D) means that the resulting bounds will be 
meaningful only for algorithms which explicitly or implicitly prefer near orthogonal dictionaries. 
Contrast this to Theorem Q] which has no significant conditions on the dictionary. 

Asymptotically almost all dictionaries are near orthogonal. A question that arises is what 
values of p,k-i can be expected for parameters n,p,k? We discuss this question and prove the 
following probabilistic result in Section HI 

Theorem 2. Suppose that D consist of p vectors chosen uniformly and independently from S n . 
Then we have 

P [lik >l) < 



^ e (n-2)/(10fclogp) 2 _ ^ 



Since low values of the Babel function have implications to representation finding algorithms, 
this result is of interest also outside the context of dictionary learning. Essentially it means that 
random dictionaries of size sub-exponential in (n — 2)/k 2 have low Babel function. 

New generalization bounds for l\ case. The covering number bound of Theorem Q] implies 
several generalization bounds for the problem of dict ionary learning for U regularized representa- 



tion which we give here. These differ from those by IMaurer and Pontill (120101) in depending more 



strongly on the dimension of the space, but less stro ngly on the particu l ar re gularization term. We 



first give the relevant specialization of the result by IMaurer and Pontill (120101) for comparison and 
for reference as we will later build on it. This result is independent of the dimension n of the un- 
derlying space, thus the Euclidean unit ball B may be that of a general Hilbert space, and the errors 
measured by Iia, d are in the same norm. 
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Theorem 3 dMaurer and Pontifeoid) . Let max a g^4 ||a|| ^ < A, and v be any distribution on the unit 
sphere B. Then with probability at least 1 — e~ x over the m samples in E m drawn according to v, 
for all dictionaries D C B with cardinality p: 



Eh 2 AD < E m h 2 AD + ^ 



p 2 (l4A + l/2 A /ln(16mA 2 ))' 



??? 



+ 



x 

2m 



Using the covering number bound of Theorem Q] and a bounded differences concentration in- 
equality (see Lemma[9]), we obtain the following result. The details are given in Section [3] 

Theorem 4. Let A > 0, with v a distribution on 8 n_1 . Then with probability at least 1 — e~ x over 
the m samples in E m drawn according to u, for all D with unit length columns: 



,, , /npln (4-y/mA) / x 4 

Using the same covering number bound and localized Rademacher complexity (see LemmafTOl). 
we obtain the following fast rates result. 

Theorem 5. Let A > 0, K > 1, a > 0, with v a distribution on S™^ 1 . Then with probability at 
least 1 — e~ x over the m samples in E m drawn according to v, for all D with unit length column: 



Eh R . D < 



+ 



K 



K - 1 
llx + 5K 
m 



E m h R D + 6K max- 



8aA" 



771 



(480) 



2 (np+l)log(f) 20 + 22 log (m) 



777 



777 



In any particular case, a and then K may be chosen so as to minimize the right hand side. 

Generalization bounds for k sparse representation. Proposition [TJ and Corollary |2] imply 
certain generalization bounds for the problem of dictionary learning for k sparse representation, 
which we give here. 

A straight forward combination of Theorem 2 of iMaurer and Pontil <|201(J (given here as The- 
orem [3]> and Proposition [Qresults in the following theorem. 

Theorem 6. Let 5 < 1 with v a distribution on S n_1 . Then with probability at least 1 — e~ x over 
the m samples in E m drawn according to v, for all D s.t. p^i{D) < 5: 



2 2 p I 14k 1 

Eh HhfD < E m h Hk , D + — I — + 




In the case of clustering we have k = 1 and 5 = and this result approaches the rates 
of lBiauetal1d2008h . 

The following theorems follow from standard results and the covering number bound of Corol- 
lary m 
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Theorem 7. Let 5 < 1 with v a distribution on S n 1 . Then with probability at least 1 — e x over 
the m samples in E m drawn according to u, for all D s.t. pk~i{D) < 8: 



/ np In 4 /^ fc / x l~4~ 

Eh Hk ,D < E m h Hh ,D + V T^ + t TT + V-- 

fc ' fc ' V 2m \ 2m \ m 

Theorem 8. Le? <5<l<iY~, a>0 vviY/z z/ a distribution on S n . Then with probability at least 
1 — e~ x over the m samples in E m drawn according to v,for all D s.t. [ik-i(D) < 5: 

P . ^ K I 8ak 2 2 (np + l)log(g 20 + 22 log (m) ' 

Eh H k ,D < T7 7 E m h H k ,D + 6K max < — (480) 

K- 1 [m(l -<5) 2 m m 

llx + 5K 
+ . 

In any particular case, a and then if may be chosen so as to minimize the right hand side. 

Generalization bounds for dictionary learning in feature spaces. We further consider appli- 
cations of dictionary learning to signals that are not represented as elements in a vector space, or 
that have a very high (possibly infinite) dimension. 

In addition to providing an approximate reconstruction of signals, sparse representation can also 
be considered as a form of analysis, if we treat the choice of non zero coefficients and their magni- 
tude as features of the signal. In the domain of i mages , this has been used to perform classification 



(in particular, face recognition) by IWright et al.l (12008). Such analysis does not require that the data 



itself be represented in W 1 (or in any vector space); it is enough that the similarity between data 
elements is induced from an inner product in a feature space. This requirement is fulfilled by using 
an appropriate kernel function. 

Definition 3. Let 1Z be a set of data representations, and let the kernel function k : 1Z 2 — > R and 
the feature mapping <ft '■ H — > H be such that: 

K(x,y) = (4>(x),<j){y)) 

where H. is some Hilbert space. 

As a concrete example, choose a sequence of n words, and let <fi map a document to the vector of 
counts of appearances of each word in it (also called bag of words). Treating n(a, b) = (4>(a) , 4>(b)) 
as the similarity between documents a and b, is the well known "bag of wo rds" approach, appli- 
cable to many document related tasks dShawe-Taylor and Cristianini . 2004) . Then the statement 



(f)(a) + 4>(b) 4>{c) does not imply that c can be reconstructed from a and b, but we might con- 
sider it indicative of the content of c. The dictionary of elements used for representation could be 
decided via dictionary learning, and it is natural to choose the dictionary so that the bags of words 
of documents are approximated well by small linear combinations of those in the dictionary. 

As the example above suggests, the kernel dictionary learning problem is to find a dictionary D 
minimizing 

^x^iyh^ AtD (x), 

where we consider the representation error function 

h^M 00 ) = min II (® D ) a ~ <t> ( x ) \\n ' 



1 



in which $ acts as eft on the elements of D, A £ {R\, H^}, and the norm ||-||^ is that induced by 
the kernel on the feature space T~L. 

Analogues of all the generalizati on bounds mentioned so far can be replicated in the kernel 
setting. The dimension free results of iMaurer and Pontill (|2010) apply most naturally in this setting, 
and may be combined with our results to cover also dictionaries for k sparse representation, under 
reasonable assumptions on the kernel. 

Proposition 2. Let v be any distribution on 1Z such that when x ~ v we have ||(/>(x)|| < 1 with 
probability I. Then with probability at least 1 — e~ x over the m samples in E m drawn according 
to v, for all D C 1Z with cardinality p such that C Bu and p^-ii^D) < 8 < 1: 



p 2 Uk/(1 -5) + 1/2 Jin 16m 




+ 



m 



x 

2m 



Note that the Babel function is defined in terms of inner products between elements of D, and 
can therefore be computed in T~L by applications of the kernel. 

This result is proved in Section [5] as well as the cover number bounds (using some additional 
definitions and assumptions described there) that are used to prove the remaining generalization 
bounds, of which one is given below. 

Theorem 9. Let 1Z have e covers of order (C/e) n . Let k : 7Z 2 — > M + be a kernel function s.t. 
k(x,u) = (<f)(X),(f)(Y)), for <p which is uniformly L-Holder of order a > over 7Z, and let 
7 = max^gjj II 00*0 II Let 5 < 1, and v any distribution on 1Z, then with probability at least 
1 — e~ x over the m samples in E m drawn according to v,for all dictionaries D C 1Z of cardinality 
p s.t. Hk-i(&D) < 5 < 1 (where $ acts like 4> on columns): 



( 



Eh Hk ,D < E m h Hk , D + 7 



V 



npln 



2am 



+ 



x 

2m 



+ 



J 



The covering number bounds needed to prove this theorem and analogs for the other general- 
ization bounds are proved in Section [5] 



3 Covering numbers of Q\ and Ti 



6,k 



The main content of this section is the proof of Theorem [2] and Corollary |2] We also show that 
the restriction of near-orthogonality on the set of dictionaries, on which we rely in the proof for k 
sparse representation, is necessary to achieve a bound on A. Lastly, we recall known results from 
statistical learning theory that link covering numbers to generalization bounds. 



We recall the definition of the covering numbers we wish to bound. I Anthony and Bartlettl (|1999f ) 



give a textbook introduction to covering numbers and their application to generalization bounds. 

Definition 4 (Covering number). Let (M, d) be a metric space and S C M. Then the e covering 
number of S defined as N (e, S, d) = min { \A\ \ A C M and S C (UaeA (a, e)) } is the size of 
the minimal e cover of S using d. 
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To prove Theorem Q] and Corollary [2] we first note that the space of all possible dictionaries is 
a subset of a unit ball in a B anach space of dimension np (with a norm specified below). Thus by 



proposition 5 formalized by ICucker and Smald (120021) the space of dictionaries has an e cover of 
size (4/e) np . We also note that a uniformly L Lipschitz mapping between metric spaces converts 
s/L covers into e covers. Then it is enough to show that ^\ defined as D \- > Kji x d an d &k 
defined as D \— > hn k! D are uniformly Lipschitz (when is restricted to the dictionaries with 
^k-i(D) < c < 1). The proof of these Lipschitz properties is our next goal, in the form of 
Lemmas |7] and [8] 

The first step is to be clear about the metrics we consider over the spaces of dictionaries and of 
error functions. We start by defining the following norm. 

Definition 5. Let D G M nxp . We denote = maxj the norm of its maximal column. 

We will use the fact upper bounds a certain induced norm. 

Definition 6 (Induced matrix norm). Let p, q G N, then a matrix A G ]g> nxm can be considered as 
an operator A : (R m , ||-|| ) — > (R n , ||-|| ). Then the p, q induced norm is defined as \\A\\ 



su PrEeK m ||x|| =1 



7" 



Factl. ||D[| lj2 < \\D\\ ME 

The geometric interpretation of this fact is that Da/ \\a\\ l is a convex combination of vectors 
each of length at most ||D|| A/£ ,, then ||Da|| 2 < II-^IIme ll a lli- 

The images of ^\ and ^ are sets of representation error functions-each dictionary induces a set 
of precisely representable signals, and a representation error function is simply a map of distances 
from this set. Representation error functions are clearly continuous, 1-Lipschitz, and into [0, 1]. In 
this setting, a natural norm over the images is the supremum norm IHI^. 

Lemma 7. The function ty\ is X-Lipschitz from (M nxm , ||-|| ME ) to C (S"- 1 ). 

Proof. Let D and D' be two normalized dictionaries whose corresponding elements are at most 
e > far from one another. Let a; be a unit signal and Da an optimal representation for it. Then 

\\(D-D')a\\ < \\D - D'W^Wa]^ < \\D - D'\\ ME {{a^ < eX. Then g x ,D'(x) < g\, D { x ) + eA 
and by symmetry we have \^\(D)(x) — ^ \(D')(x)\ < Ae. This holds for all unit signals, then 
\\^ x (D)-^ x (D')\\ OQ <Xe. ' □ 

We now provide a proof for Proposition Q] which is used in the corresponding treatment for 
covering numbers under k sparsity. 

Of Proposition^ Assume that fi^-iiD) < S < 1 < minj< p \\di\\ 2 < 7. Let D k be a set of k 
elements from D achieving the minimum on hf[ k ,D(x), with x G S n_1 . We now consider the Gram 
matrix G = (D k ) T D k . The matrix G is symmetric, therefore it scales each point in the unit sphere 
by a non-negative combination of its real eigenvalues. Also, the diagonal entries of G are the norm s 
of the elements of D k , therefore at least 1. By the Gersgorin theorem ( Horn and Johnsor] . 1990h . 



the eigenvalues of the Gram matrix are lower bounded by 1 — 5 > 0. Then in particular G has a 
symmetric inverse, which scales each point by no more than 1/(1 — 5). Then j | C — 1 1 1 ! 1 ^ V(l — 
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In particular, elements of D k are linearly independent, which implies that the unique optimal 
representation of x as a linear combination of the columns of D k is D k a with 



D k 



X. 



By the definition of induced matrix norms, we have \\a\\ x < 









f(D k ) T D k ) 1 




(D k ) T x 




l.i 





< 



7/c/(l — 5), the last bound because x is a unit vector, and D has fc columns whose norm is bounded 
by 7. □ 

Lemma 8. The function is a k j '(1 — 5)-Lipschitz mapping from the set of normalized dictionaries 
with /ife_i(D) < 5 metric induced by 



\ME 



to C (W 1 - 1 ] 



The proof of this lemma is the same as that of Lemma 13 except that a is taken to be an opti- 
mal representation that fulfills \\a\\ l < A = kj (1 — fik-i{D)), whose existence is guaranteed by 
Proposition Q] 

This concludes the proof of Theorem Q] and Corollary [2] 

The next theorem shows that unfortunately, $ is not uniformly L-Lipschitz for any constant L, 
requiring its restriction to an appropriate subset of the dictionaries. 

Theorem 10. For any k, n,p, there exists c > and q, such that for every e > 0, there exist D, D 1 
such that \\D — D'\\ ME < e but | (hH k ,D{q) — h,H k ,D'{<l)) \ > c - 

Proof. First we show that there exists c > such that every dictionary will have k sparse repre- 
sentation error of at least c on some signal. Let Ugn-i be the uniform probability measure on the 
sphere, and A c the probability assigned by it to the set within c of a k dimensional subspace. As 
c \ 0, A c also tends to zero, then there exists c > s.t. (^)A C < 1. Then for that c there exists a 
set of positive measure on which hn k ,D > c, let q be a point in this set. 

To complete the proof we consider a dictionary D whose first k — 1 elements are the stan- 
dard basis {ei, . . . , efc_i}, its k the element is = yX — e 2 /2e\ + eefc/2, and the remaining 
elements are chosen arbitrarily. Now construct D' to be identical to D except its /cth element 

< e/2). 



is v 



yl — e 2 /2e\ + Iq choosing I so that 



1 (which implies that |/| 



\D - D'\ 



ME 



\ee k /2 + lq\\ 2 < e and h Hk ,D'{<l) = 0. 



Then 
□ 



To conclude the generalization bounds of Theorems [4] [5l |7J [8] and [9] from the covering number 
bounds we have provided, we use the following two results. The first h as a simple proof w hich we 
therefore give here. The second result is an adaptation of results by iBartlett et al.l (120051) . to our 
needs, and explained further in the appendix. 

Lemma 9. Let T be a class of '[0, B] functions with covering number bound (C/e) d > e/B 2 under 
the supremum norm. Then for every x > 0, with probability of at least 1 — e~ x over the m samples 
in E m chosen according to v, for all f € T: 



Ef <E m f + B 
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Lemma 10. If F is a class of [0, 1] functions with C > 2 and d E N s.t. N (e, J 7 , i^C^)) < (7) 
for every probability measure v and e > 0, then for all K,a,x > 0, / G F, with probability at 
least 1 — e~ x over the m samples used in E m and drawn from v: 

E f < * Em/ + w m « { 2* ( 480) 2 (d + 1 " og g), 20 + 22 "*''"> 1 + 111 + 5A '. 

X — 1 I 2m m m m 

Our fast rates results are simple applications of this lemma, noting that an e cover in C (S n_1 ) 
is also an e cover under an L2 metric induced by any measure. 

OfLemma^ We wish to bound supy g jr Ef — E m f. Take F £ to be a minimal e cover of F, then for 
an arbitrary /, denoting f £ an e close member of F £ , Ef — E m f < Ef £ — E m f £ + 2s. In particular, 
supj g jr Ef — E m f < 2e + supj g jr e Ef — E m f. To bound the supremum on the now finite class 
of functions, note that Ef — E m f is a function of m independent variables (the samples chosen 
according to v), which changes by at most B/m when one of the variables is modified. Then by the 
bounded differences inequality, P {Ef - E m f - E(Ef - E m f) > t) = P(Ef- E m / > t) < 
exp (-2mB~ 2 t 2 ). 

The probability that any of the \F £ \ differences under the supremum is larger than t may be 
union bounded as \F £ \ ■ exp (—2mB~ 2 t 2 ) < exp (cilog (C/e) — 2mB~ 2 t 2 Y 

In order to control the probability with x as in the statement of the lemma, we need to have 
x = d\og(C/e) — mB~ 2 t 2 and thus we choose t = B 2 /2my / dlog (C/e) + x. Then with 
high probability we bound the supremum of differences by t which is upper bounded, using the 
assumption on the covering number bound, by B ^ ^cilog (C/e) /2m + ^x/2rnj . 

Then the proof is completed by substitution into the bound over the whole function class F and 
taking e = 1/ yfm. □ 



4 On the Babel function 

The Babel function is one of several metrics defined in the sparse representations literature to quan- 
tify an "almost orthogonality" property that dictionaries may enjoy. Such properties have been 
shown to imply theoretical properties such a s uniqu enes s of the optim al k sparse representation. 



In the algorithmic context, iDonoho and Eladl (120031) and iTroppI ([2004) use the Babel function to 



show that particular tractable algorithms for finding sparse representations are indeed approxima- 
tion algorithms when applied to such dictionaries. This reinforces the practical importance of the 
learnability of this class of dictionary. We proceed to discuss some elementary properties of the 
Babel function, and then state a bound on the proportion of dictionaries having sufficiently good 
Babel function. 

Measures of orthogonality are typically defined in terms of inner products between the elements 
of the dictionary. Perhaps the simplest of these measures of orthogonality is the following special 
case of the Babel function. 

Definition 11. The coherence of a dictionary D is p-i(D) = maxjj | (di,dj)\. 

The Babel function, in considering sums of k inner products at a time, rather than the maximum 
over all inner products, is better adapted to quantify the effects of non orthogonality on representing 
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a signal with particular level k + 1 of sparsity. The additional expressive power of ^ over 
is illustrated by considering that ensuring that < 1 by restricting [i\ implies the constraint 
Hi(D) < 1/k, which for k > 1 would exclude a dictionary in which pairs of elements have inner 
product except for some disjoint pairs whose inner product equals to half, despite such a dictionary 
having [i^ = 1/2 for any k. 

To better understand fif. (D), we consider first its extreme values. When /i^ (D) = 0, for any 
k > 1, this means that D is an orthogonal set (therefore p < n). The maximal value of (D) is k, 
and occurs only if some dictionary element is repeated (up to sign) at least k + 1 times. 

A well known generic class of dictionaries with more elements than a basis is that of frames (see 



Puffin and Schaeerl 119521) . which include many wavelet systems and filter banks. Some frames can 



be trivially seen to fulfill our condition on the Babel function. 

Proposition 3. Let D G R nxp be a frame ofM. n , so that for every v G S™" 1 we have that A < 
SILi \ i v ^i)\ — B, with \\di\\ 2 = lfor all i, and B < 1 + l/(p — 1). Then < 1. 

This may be easily verified using the relation between and ||-|| 2 in W~ l . 
4.1 Proportion of dictionaries with yU fc _i(L>) < 5 

We return to the question of the prevalence of dictionaries from D$. Are almost all dictionaries 
in Dfl If the answer is affirmative, it implies that Theorem [8] is quite strong, and representation 
finding algorithms such as basis pursuit are almost always exact, which might help prove properties 
of dictionary learning algorithms. If the opposite is true and few dictionaries are in D$, the results 
of this paper are weak. While there might be better measures on the space of dictionaries, we 
consider one that seems natural: suppose that a dictionary D is constructed by choosing p unit 
vectors uniformly from S n_1 ; what is the probability that /-ifc— l(-D) < SI 

Theorem[2]gives us the following answer to this question. Under the assumption that the sparsity 
parameter k grows slowly, if at all, as n / oo (specifically, that klogp = o{y/n)), this theorem 
implies that asymptotically almost all dictionaries under the Lebesgue measure are learnable. 
The remainder of this section is devoted to the proof of Theorem [2l This proof relie s heav ily on 



the Orlicz norms for random variables and their properties; Van der Vaart and Wellner ( 19961) give 



a detailed introduction. We recall a few of the definitions and facts presented there. 

Definition 12. Let tp be a non-decreasing, convex function with ip(0) = 0, and let X be a random 
variable. Then 

\X\ 



is called an Orlicz norm. 



\X\\t = w£tC>0:mi>l^ ) < 1 



As may be verified, these are indeed norms for appropriate ijj, such as ip2 = e x — 1, which is 
the case that will interest us most. 

By the Markov inequality we can obtain that variables with finite Orlicz norms have light tails. 

Fact 2. We have P(|.Y| > .r) < ( t ~ 2 ( ■>■/ \\X\\^ 

The next fact is an almost converse to the last fact, stating that light tailed random variables have 
finite tp2 Orlicz norms. 
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Fact 3. Let A,B>0 and P (\X\ > x) < Ae~ Bx2 for all x, where p > 1, then ||X|L < 
((l + A)/B)^ 2 . 

The following bound on the maximum of variables with light tails. 

Fact 4. We have ||maxi<j< m Xt\\^ 2 < K^log m max* \\Xi\\^ 2 . 

The constant K may be upper bounded by y/2. Note that the independence of Xi is not required. 
We use also one isoperimetric fact about the sphere in high dimension. 

Definition 13. The e expansion of a set D in a metric space (X, d) is defined as 

D e = {xe X\d(x,D) < e} , 

where d(x, A) = inf ag A d(x, a). 



Fact 5 (Levy's isoperimetric inequality 1 19521) . Let C be one half of S 71 ^ 1 , then fj, ((S n_1 \C e )) < 



, {n-2)e 2 

gexp^-^^ 

Our goal in the reminder of this subsection is to obtain the following bound. 

Lemma 14. Let D be a dictionary chosen at random as described above, then 

IM^OIU < 5HogpA/^2. 

Our probabilistic bound on is a direct conclusion of Fact|3]and Lemma [141 which we now 
proceed to prove. The plan of our proof is to bound the ip2 metric of fi^ from the inside terms and 
outward using Fact|4]to overcome the maxima over possibly dependent random variables. 

Lemma 15. Let X\,X2 be unit vectors chosen uniformly and independently from S n_1 , then 



IIK*i,*a>lll*< V6/(n-2). 
We denote the bound on the right hand side W. 

Proof. Taking X to be uniformly chosen from S n_1 , for any constant unit vector xq we have that 
(X, xo) is a light tailed random variable by Fact[5] By Fact|3l we may bound \\(X, a^o)|L 2 - Replac- 
ing xo by a random unit vector is equivalent to applying to X a uniformly chosen rotation, which 
does not change the analysis. □ 

The next step is to bound the inner maximum appearing in the definition of fXf,. 
Lemma 16. Let {di}? =1 be uniformly and independently chosen unit vectors then 



max }\(di,d\}\ 

Ac{2,...,rfA|A|=fc^' V 



< kKWy/log(p-l). 

ifa 
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Proof. Take X\ to be (D\,D\). Then using Fact[4]and the previous lemma we find 



max \X\\ 

l<X<pA\^i 



< Ky/log (p 



1p2 



l)max|||X A || 

A 



< KW Vlog (p-l) 



Define the random permutation Xj s.t. . | are non-increasing. In this notation, it is clear 

Note that \X\.\ < \X\x\ then for every i, 



that max Ac { 2 ...p}A|A|=fc XaeA I-^aI = J2j=i\ x ^\ 



\ X Xi\ ll^ 2 < 



|Jf Al |||^< WVlog(p-l) 



By the triangle inequality, 



E 



3=1 \ X ^ 



< Z?=i \\\ X ^\\L < mKW^log(p-l) 



□ 



Remark 1. Two facts are relevant to the tightness of the approximations in the last proof. First, that 
\X\ t \ are variables with positive expectation bounded away from zero, thus the norm of their sum 
must scale at least linearly in the number of summands, so the triangle inequality is essentially tight. 
Second we consider the bound || \X\ i ||L < || \X\ 1 \\\^ 2 , and note its looseness is strictly limited by 
the slow growth of ylogQ, and in any case is bounded by 2. 

To complete the proof of Lemma [TH we replace D\ with the dictionary element maximizing 
the Orlicz norm, by another application of Fact 01 and to complete the proof of Theorem |2j apply 
Fact [3] to the estimated Orlicz norm. 



5 Dictionary learning in feature spaces 

We propose in Section [2] a scenario in which dictionary learning is performed in a feature space 
corresponding to a kernel function. Here we show how to adapt the different generalization bounds 
discussed in this paper for the particular case of W 1 to more general feature spaces, and the de- 
pendence of the sample complexities on the properties of the kernel function or the corresponding 
feature mapping. We begin with the relevant specialization of the results of Maurer and Pontill 



(12010 ) which have the simplest dependence on the kernel, and then discuss the extensions to k 
sparse representation and to the cover number techniques presented in the current work. 

Theorem [3] applies as is to the feature space, under the simple assumption that the dictionary 
elements and signals are in its unit ball which is guaranteed by some kernels such as the Gaussian 
kernel. Then we take v on the unit ball of % to be induced by some distribution v' on the domain 
of the kernel, and the theorem applies to any such v' on TZ. Nothing more is required if the repre- 
sentation is chosen from R\. The corresponding generalization bound for k sparse representations 
when the dictionary elements are near orthogonal in the feature space is given in Proposition [2] 

Of Proposition^ Proposition Q] applies with the Euclidean norm of H, and 7 = 1. We apply 
Theorem [3] with A = k/ (1 - 5). □ 

The results so far show that generalization in dictionary learning can occur despite the poten- 
tially infinite dimension of the feature space, without considering practical issues of representation 
and computation. We now make the domain and applications of the kernel explicit in order to 
address a basic computational question, and allow the use of cover number based generalization 
bounds to prove Theorem [9] We now consider signals represented in a metric space (TZ, d), in 
which similarity is measured by the kernel k corresponding to the feature map cp : TZ — > %. The 
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elements of a dictionary D are now from 1Z, and we denote <&D their mapping by <fi to T~L. Then 
representation error function used is h^^D- 

We now show that the approximation error in feature space is a quadratic function of the coeffi- 
cient vector, which may be found by applications of the kernel. 

Proposition 4. Computing the representation error at a given x, a, D requires O (p 2 ) kernel appli- 
cations in general, and only O (fc 2 + p) when a is k sparse. 

Proof. Writing the error; 

\\($D)a-<j>(x)\\ 2 = {(<$>D) a - 4>{x) ,(<&D) a - 4>{x)) 

= (($£>) a, ($D) a) + (cf> (x) , (x)> - 2 (<j> (x) , (*£>) a) 

= ( Yl t ^ ai > Yl $ ( d s) a i ) + ( x ) ' - 2 ( ( x ) ' S ^ ai ) 

\i=l j=l I \ 1=1 / 

v v p 
= Y ai Yj a i (<t> (di) , 4> (dj)} + (<p(x) ,4>(x)} - a { (4>(x) ,4>(di)} 

i=l j=l i=l 
V V V 

= a i Yl a i K ^i) + K ( x > x ) — 2 ajfi; (x, di) . 

i=l j=l i=l 

□ 

We note that the k sparsity constraint on a poses algorithmic difficulties beyond those addressed 



here. Some of the common approaches to these, such as Orthogonal Matching Pursuit (IChen et al. 



1989), also depend on the data only through their inner products, and may therefore be adapted to 



the kernel setting. 

The cover number bounds depend strongly on the dimension of the space of dictionary elements. 
Taking % as the space of dictionary elements is the simplest approach, but may lead to vacuous 
or weak bounds, for example in the case of the Gaussian kernel whose feature space is infinite 
dimensional. Instead we propose to use the space of data representations 1Z, whose dimensions are 
generally bounded by practical considerations. In addition, we will assume that the kernel is not 
"too wild" in the following sense. 

Definition 17. Let L,a > 0, and let (A,d') and (B,d) are metric spaces. We say a mapping 
/ : A — > B is uniformly L Holder of order a on a set S C A if Mx, y € 5, the following bound 
holds: 

d(f(x),f(y))<L.d'(x,yr. 

The relevance of this smoothness condition is as follows: 

Fact 6. A Holder function maps an e cover of S to an Le a cover of its image f(S). Thus, to obtain 
an e cover of the image of S, it is enough to begin with an (e/L) l ^ a cover of S. 

A Holder feature map <p allows us to bound the cover numbers of the dictionary elements in T-L 
using their cover number bounds in 1Z. Note that not every kernel corresponds to a Holder feature 
map (the Dirac S kernel is a counter example: any two distinct elements are mapped to elements at 
a mutual distance of 1), and not for every kernel the feature map is known. The following lemma 
bounds the geometry of the feature map using that of the kernel. 



15 



Lemma 18. Let n(x,y) = (4>(x), 4>(y)), and assume further that k fulfills a Holder condition of 
order a uniformly in each parameter, that is, \k(x, y) — k(x + h,y)\ < L \\h\\ a . Then <fi uniformly 
fulfills a Holder condition of order a/2 with constant \[2L. 

This result is not sharp. For example, for the Gaussian case, both kernel and the feature map are 
Holder order 1 . 

Proof. Using the Holder condition, we have that \\<f>(x) — (f>(y)\\^i = k (x, x) — k (x, y) + k (y, y) — 
t (a?, y) < 2L \\x — y\\ a . All that remains is to take the square root of both sides. □ 

For a given feature mapping (j), set of representations TZ, we define two families of function 
classes so: 

V^A = {h^ Rx , D : D € V p } md 
Q^,k,s = {h</>,H k ,D ■ D G T> p A Hk-l {^D) < 5} . 

The next proposition completes this section by giving the cover number bounds for the repre- 
sentation error function classes induced by appropriate kernels, from which various generalization 
bounds easily follow, such as Theorem [9] 

Proposition 5. Let TZ be a set of representations with a cover number bound of (C/e) n , and let 
either <j) be uniformly L Holder condition of order a on TZ, or k be uniformly L Holder of order 
2a on TZ in each parameter, and let 7 = sup rfg7 ^ Then the function classes W(f>,\ and 

Q<j>,k,6 taken as metric spaces with the supremum norm, have e covers of cardinalities at most 

(C {XjL/e) 1/a y P and (c (kj 2 L/ (e (1 - , respectively. 

Proof. We first consider the simpler case of l\ constrained coefficients. If \a\ x < A and also 
max^gx) || 0(d) ||^ < 7 then by the considerations applied in section [3l to obtain an e cover of the set 
{min a || ($>£>) a — <f>(x)\\ n : D G V}, it is enough to obtain an e/ (A7) cover of {®D : D G V}. If 
also (p is uniformly L Holder of order a over TZ then an (A7L/ 'e)~ l ^ a cover of the set of dictionaries 
is sufficient, which as we have seen requires at most (^C (XjL/e) 1 ^^ elements. 

In the case of 1$ constrained representation, the bound on A due to Proposition Q] is 7 k (1 — 8), 
and the result follows from the above by substitution. □ 

6 Conclusions 

Our work has several implications on the design of dictionary learning algorithms as used in sig- 
nal, image, and natural language processing. First, the fact that generalization is only logarithmi- 
cally dependent on the l\ norm of the coefficient vector widens the set of applicable approaches 
to penalization. Second, in the particular case of k sparse representation, we have shown that the 
Babel function is a key property for the generalization of dictionaries. It might thus be useful to 
modify dictionary learning algorithms so that they obtain dictionaries with low Babel functions, 
possibly through r egularization or through certain convex relaxations. Third, mistake bounds (e.g., 
Mairal et al.ll2010l) on the quality of the solution to the coefficient finding optimization problem 
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may lead to generalization bounds for practical algorithms, by tying such algorithms to k sparse 
representation. 

The upper bounds presented here invite complementary lower bounds. The existing lower 
bounds for k = 1 (vector quantization) and for k = p (representation using PCA directions) are 
applicable, but do not capture the geometry of general k sparse representation, and in particular 
do not clarify the effective dimension of the unrestricted class of dictionaries for it. We have not 
excluded the possibility that the class of unrestricted dictionaries has the same dimension as that 
of those with small Babel function. The best upper bound we know for the larger class, being the 
trivial one of order O ((fyn 2 / m), leaves a significant gap for future exploration. 

We mention a lso that the dependence on Uk-\ can also be viewed from an "algorithmic lucki- 
ness" perspective (IHerbrich and Williamson! . 120031) : if the dictionary has favorable geometry in the 
sense that the Babel function is small the generalization bounds are encouraging. 
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Appendix A: generalization with fast rates 



In thi s appendix we justify some adaptations in Lemma [10] relative to its origins in iBartlett et al 



(120051 ') . Specifically, we assume only growth rates instead of combinatorial dimensions and give 
explicit constants for this case (no particular effort was made to make the constants tight). 

Following are some concepts and general results needed to prove Lemma [10] beyond those 
introduced in the main body of paper. 

Definition 19. Let F be a subset of a vector space X, x G X. The star shaped closure of F around 

x is 

* (F, x) = {A/ + (1 - X)x : f G F A A G [0, 1]} . 

Definition 20. A function / : M + — > M + is called sub-root if it is non negative, non decreasing and 
if t i — y f(r) I \flr) is non increasing for r > 0. 

Definition 21. Let {Zi}™ =l U {ei}™ 1 be independent variables, where e« are uniform over {—1, 1}, 
and Z{ are i.i.d. The empirical Rademacher average of T is 



Rm (J 7 ) = E 



1 n 

sup — y~] Eifi (Zi) \Z 1 ,...,Z n 
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Lemma 22. Let R m (J-) be the empirical Rademacher averages of T on a sample We 
have 

roc 



R m (.F) < 12 



o 



\ogN{e,F,L 2 (v m )) 



m 



de, 



where u m = m YT=i 



See lKakade and Tewaril (120081) for a proof. 

Lemma 23. For any 7 > ea and x 6 [0, 1], we have Jjf y^og ('y/e)de < 2xy / log (7/x). 



Proof. The indefinite integral J*^ ^ l J\og^de is xy^log ^ — -^7 • erf (y^log ^ J > where erf(i) 
2/^F f*e- u2 du. Then 



log -de 



/'° 


7 

cr — - 






log 


7 _ 




X 


log 


7 _ 









■5) 






log 


5) 







log ^ _ ^ 7 erf - lim (*^| - ^ 7 erf 



7 erf ./log 



erf (00 



The error function erf is related to the tail probability of a normal variable, also known as the 



Q function. In particular, Q (x) = ^ ( 1 — erf ( 



2Q (x) 



2Q (\/2x) = 1 — erf (x). We thus substitute and then use the bound Q(x) < e x I 2 j (xy/2n) : 



1 \ / log 1 - ^7 f erf ( \ /log 1 ) _ l ) = XA /log 2 - ^ 7 f -2Q ( J 2 log 1 - 



(Bound on Q) 




By our assumptions, - > 7 > e 1 / 2 <J=^ wlog- > \/h <J=^ — / < 4/i then 
x v x v z 2y'log^ V z 

x ( y // log^ + 2 ^ /l 1 og 2 J < 2x \J io g?;> completing the proof. □ 
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We return to prove Lemma [TOl 



Proof. The core of the proof is to define particular sub-root function, and sho w its fixed point dec ays 
as 1/m. We then apply Theorem 3.3 of Bartlett, Bousquet and Mendelson lBartlett et al.1 (120051) to 
this sub-root function to complete the proof. 
We define the function 



V> (r) = WER m {/ € ★ (.F, 0) \Ef < r} + 



11 log m 



m 



By Le mma 3.4 of iBartlett et all (|2005h (with Tf = Ef 2 and / = 0) and Lemma 3.2 Bartlett et al. 
(12005b . this function is sub-root and thus has a unique fixed point, which we denote r* , and r < 

r* <J=^ r < (r). 

To upper bound r* we first construct an upper bound on ip, in which Ef 2 is replaced by E m f 2 , 
valid for r > r*. The expectation of this upper bound is controlled using an entropy integral. 
We make two observations. 



1. By Corollary 2.2 of Bartlett et all (|2005h . with b = 1, for r > ip(r) with probability at least 

1 — 1/m, 

{/ G * (T, 0) : Ef 2 <r}c{f €*(F,0) : E m f 2 < 2r} . 

2. By assumption (V/ G F) \\f\\ Lao < 1 and this implies Rm{fe* {J 7 , 0) : Ef 2 < r) < 1. 
Combining the observations, we can bound 



Ei?™{/e*(.F,0) :Ef 2 <r] < - + ER m {/ G * (F, 0) : E m f 2 < 2r} . 



Then ^ (r) < 10 + Ei? m {/ G * (F, 0) : S m / 2 < 2r}) + 



11 log(m) 



, and in particular 



i>{r*) < 10 +ER m {f€*(T,0): E m f 2 < 2r*}^j + 



11 log (m) 



m 



We denote f m the empirical measure induced by the m samples (whose expectation is E m ). 
Under the metric L2 (v m ), the set {/ G ★ (J 7 , 0) : E m f 2 < 2r} is covered by a single ball of radius 
\/2r around the zero function. Applying Lemma l22l we have 



i? m {/G*(F,0) :E m f 2 <2r} < 12 / 

J 



log^(e,*(F,0),L 2 (u m )) 



de 



m 



|0 . 2r / logiV(£,*(F,0),L 2 (t/ m )) ffc 
'0 V m 



Since (V/ G J 7 ) ll/llw^) < 1, an e cover of J 7 can be converted into an e cover of * (J 7 , 0) by 
replacing each element by 1/e+l balls on the segment from it to 0. Then N (e, * (J 7 , 0) ,L 2 (f m )) < 
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2N (e, J 7 , 1/2 {v m )) /z (o)- Using also the assumption on C (b) and Lemma 1231(c). we find 



o 



in 



12 / ' 2 \^EEEE^Em^ 12 f'^ 



"* (?) 



C\ a 2 

e 



de 



m 



(c) 

< 24 



2r(d+l)log(^- 



Substituting into ^ ( r *)> we obtain that 
/ 



r* <10 



— +24 

m 



\ 



i 



2r* (d + 1) log 



c 

/2r* 



+ 



11 log (m) 



77? 



/ 



=240 



2r*(d + l)log(^J 10 + 11 log (m) 



+ 



m 



m 



Let a > be fixed. If r* < aC 2 /2m, our first step is complete. If not, then 



r* > aC 2 /2m yjm/a > C/V2r*, 



and then 



r* < 210 ./ ^(d+l)l O g(Vf) 10 +11 l og ( 



7?) 



24 /r^+l)logJf) 10 + 11 l og ( 



??? 



< 2 max < 2401 



(d+l)log(ffl) 10 + 11 log 



Then either r* < (20 + 22 log (m)) /m (and the first step is complete), or 

r* < 480 ^(r* (d + 1) log (mjaj) /m <=^ r* < (480) 2 ((d + 1) log (m/a)) /m 
and again we are done. We conclude that 



J a ° 2 (amV {d + 1} log ( «) 20 + 221og ( 
r < max < — — , (480) — 



m 



2m 



m 



rn 
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Having proved r* decays approximately as 1/m, we apply Theorem 3.3 of iBartlett et all (12005). 
with a = 0; b = 1; B = 1; Tf = Ef 2 . By definition of T, it is clear that 



i> (r) = WER m {/ € *(.F, 0) |£/ 2 < r} + 

>Ei? m {/e*(.F,0)|£/ 2 <r} 
= E12 m {/€*(.F,0)|r/<r} 



1 1 log m 



holds, then we can use part 2 of Theorem 3.3 of IBartlett et al.l (120051) . which allows the conclusion 
that for all /g J, Ef < -^E m j + 6Kr* + Ux + 5K . □ 
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